This exercise reflects what we have learned on Udacity’s Data Analyst Nano Degree, Exploratory Data analysis lesson. The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y)
The data set is provided by the UC Irvine Machine Learning Repository [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014
In this section, we will will explore many variables and their distributions. The objective is to have a general understanding of the data presented in the data set.
## [1] 41188
The data is divided in 21 variables
## [1] "age" "job" "marital" "education"
## [5] "default" "housing" "loan" "contact"
## [9] "month" "day_of_week" "duration" "campaign"
## [13] "pdays" "previous" "poutcome" "emp.var.rate"
## [17] "cons.price.idx" "cons.conf.idx" "euribor3m" "nr.employed"
## [21] "y"
Input variables: # bank client data:
1 - age (numeric)
2 - job : type of job (categorical: “admin.”,“blue-collar”,“entrepreneur”,“housemaid”,“management”,“retired”,“self-employed”,“services”,“student”,“technician”,“unemployed”,“unknown”) 3 - marital : marital status (categorical: “divorced”,“married”,“single”,“unknown”; note: “divorced” means divorced or widowed)
4 - education (categorical: “basic.4y”,“basic.6y”,“basic.9y”,“high.school”,“illiterate”,“professional.course”,“university.degree”,“unknown”)
5 - default: has credit in default? (categorical: “no”,“yes”,“unknown”)
6 - housing: has housing loan? (categorical: “no”,“yes”,“unknown”)
7 - loan: has personal loan? (categorical: “no”,“yes”,“unknown”)
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: “cellular”,“telephone”)
9 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
10 - day_of_week: last contact day of the week (categorical: “mon”,“tue”,“wed”,“thu”,“fri”)
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=“no”). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: “failure”,“nonexistent”,“success”)
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)
Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: “yes”,“no”)
Data is grouped in 4 ‘types of data’ :
The last one, named ‘y’ represent the success/failure of the marketing campaign.
“The client has purchased the service marketed”
We can quickly see that the Client data aspect of the data set is predominantly based on factor type of variables, except for the ‘age’ variable.
For each variable displayed, the count and proportions of the total are represented next to each other.
Variables in this section are categorical but also numerical.
The following plot represent the campaign variable, described as
number of contacts performed during this campaign and for this client (numeric, includes last contact)
The main feature of this data set is that it clearly represents the information produced during a business process : A direct marketing campaign. Business driven, we can perceived a success/failure variable that will be key to analyse the other variables.
Besides the success/failure variable, we are provided with 3 blocks of data :
We will investigate how success variable is related to those blocks, and if one is more significant in the correlation with a success event
Yes, We created a new variable that ‘merges’ the information contained in housing and loan variables. The objective is to have one variable that aggregate the binary condition of having a loan.
The new variable is called ‘hasloan’. Value ‘yes’ if housing == ‘yes’ OR loan == ‘yes’
Yes, as seen previously, the Macro socio-economic block of data has a rather unusual distribution. I did not perform any operation on these variables as I have no experience dealing with this type of data. Nothing seems to point to a corrupted/flawed data input.
Regarding the duration variable, we can apply the log transformation in order to make it as normal as possible. Although after lots of reading, and still not understanding it completely, we created a blog post that helped a lot. In it, we understood that the log transformation, besides making the plot look better, helps making a future linear regression model more performance.
Now that we have a good understanding of each variable of its own, lets start building relationships. Our focus of investigation will trying to understand what variables are significantly correlated with the ‘y’ variable, indicating success in the marketing campaign.
After we will look for relationships outside of the y scope
Variable age does seem to have a tendency regarding the Y variable. However, when looking both plots one above the other on the same x axes, we can see that changes in Y proportion (success/failure) occur also when the frequency count of the each age bin change (bin width=1yr, IE ~100 observations for age 16, ~1750 observations for age 35 ).
The more observation in any given bin, the less Y (sucess) proportion
Let’s plot it in a classic box plot
Nothing catch our attention, except a larger IQR in the Yes (success) observation. Lets make use of the library GGally to plot the relations between variables, trying to always relate to the output variable Y.
GGpairs is no magic. After the creation of the matrices, little is uncover really. Maybe it is because of the categorical nature of the y variable ? We’ve purposely reduce the combinations possible to avoid having a 21x21 unreadable matrix. Lets get deeper for the variables that have shown significant differences for the Y(success/failure) variable.
We now want to digg deeper on the “campaign variable”. Let’s remember it’s meaning :
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
Intuitively, we would believe that an increment in the number of calls increase the chance of success, as there is longer time of marketing power to convince the client. Let’s see.
So we observe that there is a correlation (but negative one!), the more contacts (calls) a client has the less he is willing to convert (Y variable value ‘yes’). We deduce that taking number of call as a proxy for ‘duration in contact with marketing agents’ might be not accurate
Lets test with the actual ‘duration variable’. This one should give interesting insights as we are warned
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=“no”). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
So we find a clear tendency of improving success rates whit increments in the duration of the call. As we are warned, there is no much use of this insight as is because duration is a value known AFTER the call is ended. so no prediction can be done based only in this parameter.
Now lets try to check if there is a relation between having a loan and subscribing to a term deposit : hasloan vs y variables. Remember, hasloan is a variable created in the uni variate analysis. It combines the info of
6 - housing: has housing loan? (categorical: “no”,“yes”,“unknown”)
7 - loan: has personal loan? (categorical: “no”,“yes”,“unknown”)
It seems that not. We wish to show those percentages
## Source: local data frame [4 x 4]
## Groups: hasloan [2]
##
## hasloan y n rel.freq
## <ord> <ord> <int> <chr>
## 1 yes yes 2781 11.5%
## 2 yes no 21352 88.5%
## 3 no/unknown yes 1859 10.9%
## 4 no/unknown no 15196 89.1%
And effectively we can see that there is no difference to be noticed comparing hasloan vs y variable.
Since we have the information, we will check the relationship between job tittle and education.
This sample population respond to what we would expect :
How does age affect jobs category ?
We wanted to see if marital situation has an impact on having a housing loan (filter(housing!=‘unknown’))
It seems not.
We want to unveil data about observation with default = yes. Let find out what proportion of the entire population is in that condition.
## [1] "Quantity of observations with variable default = yes : 3"
The population being to small, we cannot provide any insights about the defaulted population.
We tried to explore various relationships with the success/fail variable.
age vs y : we found a tendency of changing success rate by age, however we observed also a tendency of observation count with age. Further investigation must be done in order to test the significance of quantity of observations in the success/failure ‘y’ variable.
Later boxploting age vs y do not show difference in the descriptive statistics for both groups (success/failure)
Then we tried ggpairs matrix, giving a hint to deeper in the macro socio economical variables .
‘Macro social and economic Portuguese context data’ vs y : ex euribor3m : Boxplot unveil differences in the median of groups (although IQR for success/fail share 80% of the euribor3m values). Indeed later plotting of the proportion for each euribor3m level (bin width = 1) in an histogram show clear evidence of greater success rate in the lower levels of the euribor3m values (25% for the lower, 5% for the higher values). Similar findings result of the plotting of the other 4 Macro social and economic variables.
Further analysis with an expert in those variables is due.
Then we compared campaign vs y , setting the hypothesis that more contact/call where indicative of better/higher chances of success (more time to be convinced with more arguments). We find that the opposite is true, with better conversion rates in the firsts calls. Further investigation should be done to understand if # of contacts is a good proxy for more contact with marketing agents (ex: process could be ‘only generates a new contact is previous one did not engage in actual conversation with the client’).
-Trying the actual duration variable, we observe a clear insight on a correlation between duration and y success. However this finding is not exploitable as is, because the value of duration is known only after the call (and its success/fail status) is ended.
As we did not find any strong correlation between success/failure variable and other variables, we started looking for relationships against other pair of variables. We found that this sample population respond to what we would expect of a job vs education analysis and job vs age.
We hypothesized on the marital housing variable vs marital status, trying to find a correlation. None was found.
Lastly, we wanted to inquire about the defaulted population, but the amount of observation was very small (3/40.000)
The strongest relationship was found between the Macro economical variable and the success/failure variable. Professional input is required because at this level we do not understand the logic of
The marketing campaing for subscription to term deposit has better rates of sucess when euribor3m level are at their lowest (meaning less interest rates meaning less return on the capital deposited)
There is also the y vs duration relationship. but we are warned that it is not usable as is for predictions.
## [1] "We will categorize the Euribor3m variable and \n#make use of the bucket with >100 count to facet the plot"
##
## (0,1] (1,2] (2,3] (3,4] (4,5] (5,6]
## 3908 9590 0 14 27667 9
We confirm that Euribor3m has an effect on conversion as presented on bi variate analysis.
We do not observe any substantial variability based on education
First iteration
Second iteration :
We want to add the y variable information , so we need to create the information in order to be plotted
## Source: local data frame [177 x 6]
## Groups: job, education [90]
##
## job education y n rel.freq_sucess total_obs
## <fctr> <fctr> <ord> <int> <dbl> <int>
## 1 admin. illiterate no 1 100.00 1
## 2 admin. basic.4y yes 10 12.99 77
## 3 admin. basic.4y no 67 87.01 77
## 4 admin. basic.6y yes 8 5.30 151
## 5 admin. basic.6y no 143 94.70 151
## 6 admin. basic.9y yes 42 8.42 499
## 7 admin. basic.9y no 457 91.58 499
## 8 admin. high.school yes 382 11.47 3329
## 9 admin. high.school no 2947 88.53 3329
## 10 admin. professional.course yes 49 13.50 363
## # ... with 167 more rows
Rel.freq is the proportion of success of the y variable for each [Job_Education] combination. We need to remove one line of Y variable for each [Job_Education], luckily as y is binary, we can remove (and plot) just one line. Subset to remove the y==‘no’ is in the order as following
ggplot(data=subset(data_sum,data_sum$y=='yes'&
data_sum$education!='illiterate'&data_sum$total_obs>100), aes(job, education))
The color palette shows the %rate for success ( variable y == yes)
The color palette shows the %rate for success ( variable y == yes)
The marital status variable does not bring information, we can see this as the ‘color’ (meaning success rate) do not change with marital status.
In this section we compared different combination of variables
No new combination of variables showed pivotal information for the analysis, besides what we already knew form bi variate section.
The main variables to infer sucess/failure of the marketing campaign are the MacroEconomic (ex Euribor3m) or Duration
We did observe that client in the job category [student,retired] to have more appetite for the product. This is similar to our findings in our first arrival analysis (age vs y). What’s interesting in this plot ( variable vs job vs y) is that we can observe similar sized group (ex : [retired + married] vs [entrepreneur + married ]) but the success rate does indeed change positively towards extreme ages (retired+students). There is then another proof that age does influence success of the campaign
This plot shows us quickly the success rate for the variable y on different ages. Bin width = 1 year. This plot was the first where we find a clear behavior. The concave curve could have been shown with other type of geom (maybe geom_polygon), but we find that the bar express correctly the binary outcome of the variable : success/fail, 1/0, T/F.
We chose this plot because it’s when we started to feel comfortable with the exploration. Ggplot library was starting ti become an ally instead of a headache. Hear, we decided to convert Euribor3m variable into a categorical one, ordered in order to use it as a facet. We choose ncol = 1 in order to have each level in the same shared x axis. We made use of transformations for the x & y variables, in order to uncover the findings : Euribor3m and duration had a positive effect on success. As we can observe when fixing duration (thus looking at the facets by Euribor3m), the proportion of success is higher on the first level. Furthermore, when fixing Euribor3m (thus looking at the one of the above 3 plots), proportion of success is higher when duration (of the contact call) is extended.
## Source: local data frame [177 x 6]
## Groups: job, education [90]
##
## job education y n rel.freq total_obs
## <fctr> <fctr> <ord> <int> <dbl> <int>
## 1 admin. illiterate no 1 100.00 1
## 2 admin. basic.4y yes 10 12.99 77
## 3 admin. basic.4y no 67 87.01 77
## 4 admin. basic.6y yes 8 5.30 151
## 5 admin. basic.6y no 143 94.70 151
## 6 admin. basic.9y yes 42 8.42 499
## 7 admin. basic.9y no 457 91.58 499
## 8 admin. high.school yes 382 11.47 3329
## 9 admin. high.school no 2947 88.53 3329
## 10 admin. professional.course yes 49 13.50 363
## # ... with 167 more rows
In this plot, we managed to produce exactly what was in mind. We had to create a group by Data Frame, and the subset it in order to get the Rel.freq information. In this plot we show that age (expressed by jobs : student & retired) indeed have an impact on the success of the campaign. We can see that for those 2 categories of jobs, success rate is higher.
This project was very challenging, on so many levels.
Data source : After searching for data in another category (tourism transportation), we realized the scarcity of clean and ‘ready-to-use’ data sets. We had to abandon quickly the idea in order to get to work. Kaggle Datasets is an invaluable asset.
Once we choose this marketing data set, the first mayor difference with the Udacity’s lessons was the lack of meaningful continuous variables. All the value is in categorical variables and the plotting that results lacks of the ‘cool’ factor that continuous vs continuous plotting have.
Then, we found our self heavily leaning our analysis towards just one variable : the success/fail. The ‘exploratory’ part of EDA felt more like “how to predict y”. Indeed this a Machine learning data set and by nature the exercise is to predict the outcome.
Wanting to explore the data set outside the Y variable took MUCH more time than anticipated. We really felt the exploratory of EDA in the time consuming. Plus the sometimes uncomfortable feeling of going nowhere useful(Housing vs Marital situation), or showing just obvious concepts (job vs education)
On the positive note, we felt intrigued by behaviors observed in the data set (Euribor3m vs y). We conclude that input from professional on other areas (like macro economics in this case) is necessary to unveil more information.
Overall, this project felt like a real challenge and we learned tons from it. Now, machine learning!!!